Reduce threading scheduler contention for smoothing filter#3280
Merged
Conversation
Previously, we were dispatching the filter smoothing one image line at a time via `ThreadedLoop(..., axes, 1)`. Profiling confirmed that this was causing millions of scheduler lock acquisitions for large images. To substantially improve the situation, we use two inner axes when possible to increase chunk size from single lines to small slices of the image. The result is less scheduler contention and lower OS overhead.
|
clang-tidy review says "All clean, LGTM! 👍" |
Lestropie
approved these changes
Mar 20, 2026
Member
Lestropie
left a comment
There was a problem hiding this comment.
This is probably the case in multiple other pieces of code also. When I get the chance I'll do a grep search across the repo and up the inner loop axis count for cheap operations. But if you want you can merge this and I'll extend separately.
Member
Author
|
I think each individual case may be different, so I'll merge this as it is. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
It was recently pointed out to me by @Lestropie that the use of
threaded_copycan be quite slow if the work done per voxel is trivial, because thread management overhead may dominate. This can be easily be mitigated by specifying two or more inner axes.While profiling
mrregister(for comparison with my GPU registration work), this fact came to my mind when I noticed thatFilter::Smooth::operator()was in the hotpath of the code.In smooth/h, we have the following code:
It turns out the same idea applies here. Profiling confirmed that the current strategy was causing millions of scheduler lock acquisitions for large images when running
mrregister. This PR substantially improves the situation by using two inner axes when possible to increase chunk size from single lines to small slices of the image. The result is less scheduler contention and lower OS overhead.The performance improvement can easily be seen on Linux (AMD Ryzen Threadripper PRO 5975WX 32-Cores). Running
/usr/bin/time -v commandshows:And after the change:
As you can clearly see, the time taken for the same command is substantially lower (same goes for the OS voluntary/involuntary context switches).